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Abstract 

In this paper we consider the problem of recovering a high dimensional data matrix from a set of 
incomplete and noisy linear measurements. We introduce a new model that can efficiently restrict the 
degrees of freedom of the problem and is generic enough to find a lot of applications, for instance in 
multichannel signal compressed sensing {e.g. sensor networks, hyperspectral imaging) and compressive 
sparse principal component analysis (s-PCA). We assume data matrices have a simultaneous low-rank 
and joint sparse structure, and we propose a novel approach for efficient compressed sensing (CS) of 
such data. Our CS recovery approach is based on a convex minimization problem that incorporates this 
restrictive structure by jointly regularizing the solutions with their nuclear (trace) norm and ^2/^1 mixed 
norm. Our theoretical analysis uses a new notion of restricted isometry property (RIP) and shows that, for 
sampling schemes satisfying RIP, our approach can stably recover all low-rank and joint-sparse matrices. 
For a certain class of random sampling schemes satisfying a particular concentration bound {e.g. the 
subgaussian ensembles) we derive a lower bound on the number of CS measurements indicating the 
near-optimality of our recovery approach as well as a significant enhancement compared to the state-of- 
the-art. We introduce an iterative algorithm based on proximal calculus in order to solve the joint nuclear 
and ti/ti norms minimization problem and, finally, we illustrate the empirical recovery phase transition 
of this approach by series of numerical experiments. 
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I. Introduction 



Suppose we are given an m x n<i matrix X which is joint-sparse, that is only k -C n\ rows of this 
matrix contain nonzero entries (a similar argument holds for joint-column-sparsity), and in the meantime 
the underlying submatrix of nonzero rows (and equivalently the whole matrix X) has a low rank r <C 
mhx(k, 712). Such matrices have few degrees of freedom: if one knows the indices of those k nonzero rows, 
the corresponding submatrix will only have (k + ri2 — r)r degrees of freedom. Following the recent trend 
in the areas of compressed sensing CD-El , sparse approximation |4), HI and low -rank matrix recovery 
IH-O, we would like to know how many non-adaptive linear measurements of this matrix are sufficient 
to robustly represent the whole data and, more importantly, how this compares to the state-of-the-art. 
The following questions are therefore natural: 

• What are the efficient information/structure-preserving measurements? 

• Is there any computationally tractable algorithm that incorporates the simultaneous low-rank and 
joint-sparse structures into data recovery and does it perform better than the state-of-the-art? 

• And finally, how many measurements do we need to recover an exact low-rank and sparse matrix, 
and is the recovery algorithm stable with respect to noisy measurements and matrices that are 
approximately low-rank or not exactly joint-sparse but compressible? 

This paper attempts to answer these questions. In this regard, we introduce a novel approach for CS 
reconstruction of simultaneous low-rank and joint-sparse matrices. Our approach is based on a convex 
problem that simultaneously uses low-rank and joint-sparsity promoting regularizations, namely, the 
nuclear {trace) norm and the £2/^1 mixed-norm of the solution matrix. As a result, the solution of 
this problem tends to have both desired structures. Further, we propose an efficient algorithm in order to 
solve this optimization problem. 

Additionally, we establish a theoretical upper bound on the reconstruction error guaranteeing the 
stability of our recovery approach against noisy measurements, non-exact sparse and approximately low- 
rank data matrices. We prove that, if the sampling scheme satisfies a specific restricted isometry property 
(RIP), the proposed approach can stably recover all joint-sparse and low-rank matrices. In particular and 
for a certain class of random sampling schemes, we prove that this property implies the following lower 
bound on the number m of CS measurements: 



Compared to the necessary condition for recovery of such structured matrices, i.e. m > (k + 712 — r)r, 
this measurement bound highlights the near-optimality of our recovery approach: its scaling order with 




(1) 
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respect to k and ri2 has only an extra log factor due to the joint-sparsity pattern subset selection. 

A. Application in Multichannel Signal CS 

Multichannel signals are undoubtedly a perfect fit for the new compressed sensing paradigm: on 
one hand, they have numerous applications e.g. in computer vision, biomedical data, sensor networks, 
multispectral imaging and audio processing, just to name a few. On the other hand, the main present 
challenge is the tremendously large and increasing flow of data in those applications, which poses tight 
constraints on the available technologies at any of the acquisition, compression, transmission and data 
analysis steps. Despite their huge dimensionality, multichannel data are often highly redundant as a 
result of two categories of correlations: inter- and mfra-channel correlations. Any efficient dimensionality 
reduction scheme requires comprehensive modeling of both types of correlations. A straightforward 
extension of the classical single-channel CS idea (H, O only considers inter-channel signal correlations 
through an unstructured sparsity assumption on the signal in each channel and the standard recovery 
approaches such as i\ minimization result in limited improvement. 

The simultaneous low-rank and joint-sparse model turns out to be an ideal candidate to more restric- 
tively model multichannel data and the underlying structures. In this case, the matrix X represents a 
multichannel signal where 112 denotes the number of channels and ri\ is the signal dimensionality per 
channel. This model not only considers inter-channel signal correlations leading to a sparse representation, 
but also takes into account intra-channel correlations and is able to assign a particular structure for the 
sparsity : signals of different channels share a unique dominant sparsity pattern (support) and in addition, 
the values of those sparse coefficients are linearly dependent across the multiple channels. Beside its 
restrictiveness, this model is generic enough to hold for a wide range of applications where the entire 
multichannel signal can be decomposed into few sparse sources through a linear source mixture model 
as follows: 

X = SH T , 

where S is an n\ x p matrix whose columns contain p <C min(rai,?i2) source signals that are linearly 
mixed with an 712 x p matrix H in order to generate the whole data across multiple channels. Indeed, 
for a sparse (w.l.o.g. in the canonical basis) source matrix S and an arbitrary mixing matrix H, this 
factorization implies a low-rank (r < p) and joint-sparse X. 

We have discussed in our previous articles |fT0l - lfT2l the application of this model in compressive 
hyperspectral imagery. The linear mixture model typically holds for this type of images where H contains 
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the spectral signatures of a few materials present in the region whose distributions are indicated by the 
source images, i.e. the columns of S. We have shown that leveraging both low-rank and sparse structures 
enables CS to achieve significantly low sampling rates: A hyperspectral image can be subsampled by 
retaining only 3% of its original size and yet be robustly reconstructed. 

Alternatively, the same model can be deployed for a much more efficient CS acquisition in sensor 
network applications and it can eventually enhance the tradeoff between the number ri2 of sensors and 
the complexity (e.g. the number of measurements per channel m/riz) of each sensor. Indeed, typical 
sensor network setups are monitoring highly correlated phenomena which are mainly generated by a 
small number of (sparse) sources distributed in a region, and H represents a possibly unknown Green's 
function accounting for the physical source-to-sensor signal propagation model. 

B. Compressive Sparse PCA 

Despite similarity in names, this paper considers a different problem than the robust principal com- 
ponent analysis (r-PCA) |[T3l , Ifl4l . We are not interested in decomposing a data matrix into separate 
low -rank L and sparse S components, i.e. X = L + S. In contrast, here X simultaneously have both 
structures and our problem is more related to the sparse principal component analysis (s-PCA) literature 
lfT51l - lfT7ll . The s-PCA problem considers the following factorization of a data matrix: 

X m UV T 

where, the columns of V £ M px " 2 contain the few p <C min(ni, 712) principal components (PCs) of X and 
U E M niXp is a sparse matrix of the corresponding loadings of the principal components. The classical 
PCA ifTBll , which is a key tool for multivariate data analysis and dimensionality reduction, numerically 
involves a singular value decomposition with orthogonal matrices U and V. However, the classical 
approach comes with a key drawback: in practical setups with n\ variables and rii observations, most of 
the loadings are nonzero and thus, the main factors explaining data variance are linear combinations of all 
variables, making it often difficult to interpret the results. The idea of s-PCA has been mainly proposed 
to address this shortcoming at the cost of applying more complex algorithms and loosing orthogonality 
of U and V. 

As previously described in Section ll-Al this factorization (i.e. having a small number p of PCs with a 
sparse loading matrix U) implies a low-rank and joint-sparse structure for UV T . Therefore an alternative 
approach to solve s-PCA is to first use our method to approximate X by a simultaneous low-rank and 
joint-sparse matrix, and then apply the conventional PCA to decompose the result into U and V. This 
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approach (i.e. s-PCA in the compressive domain) could be particularly useful for large-dimensional data 
analysis and visualization where data size is the main bottleneck. In addition, and to the best of our 
knowledge, the theoretical analysis provided in this paper is the first in this direction. 

C. Paper Organization 

This paper is organized as follows: in Section HTl we review the most related works in the compressed 
sensing literature and discuss the efficiency of different data priors for CS recovery. Section UlTl presents 
our novel CS reconstruction scheme for simultaneous low-rank and joint-sparse data matrices. In Section 
HVl we establish the theoretical guarantees of this recovery approach. The proofs of the main theorems 
are provided in Section [V] In Section [VT] we introduce a set of iterative algorithms based on the proximal 
splitting method in order to solve the convex recovery problem. Finally, in Section I VII I we show the 
empirical recovery phase transition behavior of our method for different CS acquisition schemes and we 
compare them to prior art. 

D. Notations 

Throughout this paper we frequently use the following notations: 

For the matrix X G M. niXn ' 2 , X^ and X.j respectively denote its ith row and jth column, and [X]ij 
is the element of X in the ith row and the jth column. Moreover, X vec denotes the vector formed by 
stacking the columns of X below one another: 



The vector space of matrices is equipped with an inner product defined for any pair of matrices 

X,X' e R niXn = as follows: 



where, Tr(.) is the trace of a square matrix. Moreover, four matrix norms that we frequently use are: 

1) The spectral norm ||X||: the largest singular value of X. 

2) The nuclear norm \\X\\* = ^ o~i which is the sum of the singular values (cjj) of X. 



X, 



X ;2 



vec — 



X 



{X,X')^Tr{X T X') = Y,[X] l ,AX'] 



3) The Frobenius norm 



X\\ F 4 ^X) = y/Z id [X\l = VZ^- 



4) The £2/^1 mixed-norm 




sum of the £2 norms of the matrix 



rows. 
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Since ||.||* is a norm, it is subadditive i.e., \\X + X'\\* < \\X\\* + \\X'\\* and the equality (additivity) 
holds if X and X' have row and column spaces that are orthogonal X T X' = X(X') T = fl*Q. 
When X is a rank r matrix, we have the following inequalities: 

11*11 < \\X\\ F < ||X||* < v^II^IIf < rpT||. (2) 

In addition, the support of a vector x G M n is defined as the set containing the indices of the nonzero 
elements of x i.e., 

supp(x) = |z G {1 . . . n} : o|. (3) 
This definition can be extended to a matrix X G R n i xri 2 as follows: 

supp(X)^|J su PP( X -j)' W 

which is the union of the supports of all of its columns. We call X a k-joint-sparse matrix when there are 
no more than k rows of X containing nonzero elements, Card(supp(X)) < k, where Card(.) denotes 
the cardinality of a set. Note that, if X is k -joint-sparse the following inequalities hold: 

\\X\\ F <\\X\\ 2 ,i<Vk\\X\\ F . (5) 

Finally, throughout this paper we often use the notation Xf which is a matrix of the same size as X 
whose rows indexed by the set T are the same as X and zero elsewhere. 

II. Compressed Sensing 

Compressed sensing 12, O has been introduced in the past decade as an alternative to Shannon's 
sampling theorem. The original foundations of this new sampling paradigm were built over taking 
advantage of the sparsity of a wide range of natural signals in some properly-chosen orthonormal bases 
and representing large dimensional sparse vectors by very small number of linear measurements, mainly 
proportional to their sparsity levels. 

1) CS Acquisition Protocol: m <^ n\ri2 linear and possibly noisy measurements are collected from 
the data matrix X by defining a linear mapping A : W ll * n °- ->■ R m and its adjoint A* : W 11 -> W niXri2 
as follows: 

y = A{X) + z, (6) 

where y G W n is the vector containing the CS measurements and z G W n is the noise vector due 
to quantization or transmission errors. This formulation equivalently has the following explicit matrix 
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representation: 

y = AX vec + z, (7) 

because any linear mapping A{X) can be expressed in a matrix form A{X) = AX vec , where the inner 
product of the data matrix and each row of A G fl£ mxn i n 2 gj ves a noise-free element of the measurement 
vector. In a similar way, we can express the adjoint operator as 

A*(y) = reshape(^ T y,ni,n 2 ). 

The reshape operator applies on an n±?i2 dimensional vector A T y in order to reshape it into an n\ x 17,2 
matrix such that the first n\ elements fill the first column, the next n\ elements fill the second column 
and so on. 

The main goal of CS is to robustly recover X from the smallest number of measurements m. Despite 
the simplicity of using a linear sampling scheme and unlike Shannon's linear signal reconstruction, CS 
decoding applies more complex and nonlinear approximation methods. In the following we mention some 
of the most relevant approaches to our work: 

2) Unstructured Sparse Recovery by l\ Minimization: The following l\ problem, also known as Basis 
Pursuit De-Noising (BPDN), was among the most successful solutions that have been initially proposed 
for CS recovery of a sparse vector X vec : 

argmin ||X„ ec ||^ subject to \\y - A(X)\\e 2 < e. (8) 
x 

Note that regularizing the solutions by their i\ norm, although known as the best sparsity-inducing convex 
functional, can only exploit unstructured sparsity in the data. It has been shown that if the sampling 
operator A satisfies the restricted isometry property (RIP), then the solution to ([8]) can stably recover 
the underlying data CQ, El- For sampling matrices A (or equivalently A in ©) whose elements are 
drawn independently at random from the Gaussian, Bernoulli or (in general) subgaussian distributions, 
this property implies the following lower bound on the number of measurements: 

m>0(fcn a log(ni/ft)), (9) 

which indicates a suboptimal performance compared to the limited degrees of freedom of a simultaneous 
low-rank and joint-sparse data matrix. 

3) Joint-Sparse Recovery by £2/^1 Minimization: To overcome this limitation, some alternative recov- 
ery approaches propose to consider a restrictive structure on the sparsity pattern of the data, namely, a 
joint-sparsity structure for the nonzero coefficients of X. Some approaches based on greedy heuristics 
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Il20l - ll22l attempt to identify the joint-sparsity support of X, whereas other methods based on convex 
relaxation |[23l - |[25l aim at recovering both the joint-sparsity pattern and the nonzero coefficients by 
penalizing other matrix norms promoting a joint-sparsity structure to their solutions. In particular, pe- 
nalizing the £2/(1 mixed norm of the data matrix through the following convex problem is very popular 
and has been widely used in the literature: 

argmin 1 1 1 1 2 1 subject to \\y — A{X)\\i 2 < e. (10) 
x 

Despite significant improvements in support recoverjo, the joint-sparsity assumption fails to enhance the 
entire data reconstruction {i.e. recovering both the support and the nonzero coefficients). Eldar and Mishali 
||25l extend the notion of RIP to the Block-RIP and show that, if the sampling matrix A satisfies this 
property, the £2/^1 problem can stably reconstruct joint-sparse matrices. For a dense i.i.d. subgaussian 
sampling matrix A, the number of measurements sufficient for satisfying the Block-RIP is 

m > 0(Hog(rai/fc) + kn 2 ). (11) 

As we can see, for a large number n 2 3> k log(m/k), this bound indicates only a minor enhancement of 
a log(ni/fc) factor compared to the measurement bound © corresponding to the classical £\ problem. 
This is a fundamental limitation arising from neglecting the underlying correlations among the values 
of nonzero coefficients i.e., with such a loose assumption (the joint-sparsity model) the data matrix 
can have kn 2 nonzero elements freely taking independent values, and thus generally one requires 0(k) 
measurements per column for recovery. 

4) Low-Rank Recovery by the Nuclear Norm Minimization: Many works have been lately focusing 
on recovering a low-rank data matrix from an incomplete set of observations. In particular, the matrix 
completion problem Q-Q deals with recovering a data matrix given only a small fraction of its entries. 
The key assumption for a stable reconstruction is that the data matrix is highly redundant due to high 
linear dependencies among its columns/rows and therefore it has a low rank. Similarly, Candes and Plan 
ll26l have shown that if the sampling operator A satisfies an extended notion of RIP, called the Rank- 
RIP, then the following nuclear norm minimization problem can stably recover the underlying low-rank 
matrix: 

argmin ||X||* subject to \\y — A(X)\\i 2 < e. (12) 
x 

'it has been shown in [22] that a simple one-stage p-Thresholding algorithm can correctly identify the support of an exact 
joint-sparse matrix provided m > 0(fc log(ni)). 
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In particular, if A is drawn from the i.i.d. subgaussian distribution, then the number of measurements 
sufficient for A to satisfy the Rank-RIP (and hence data recovery) is 



Compared to (fTTI) and for a large number 77,2, we require less number of measurements thanks to properly 
dealing with the high data correlations via nuclear norm regularization. As we can see in (TT3l and 
compared to the bound (fTTI) . the growth of m remains linear by increasing the number 712 of columns, 
but scales as r?i2 instead of kri2 with r <C k. For a large number 112, this bound indicates that 0(r) 
measurements per column are sufficient which is proportional to the number of principal components 
of X rather than the sparsity level of each column 0(/c) in (fTTI) or ©. Despite this improvement, for 
n\ S> ?i2 ignoring sparsity would significantly decrease the efficiency of this approach compared to the 
few degrees of freedom of a low-rank and joint-sparse data matrix. 

5) Rank Aware Joint-Sparse Recovery: Considering those previously described approaches and their 
weak and strong points, one can think of the idea of combining both data priors together and reconstruct 
matrices that are simultaneously low-rank and joint-sparse. Recently a few works lf27l . lf28l have consid- 
ered rank awareness in data recovery from multiple measurement vectors (MMV). More precisely, the 
joint-sparse MMV inverse problem lf20l . ||29l - lf3Tl focuses on recovering a joint-sparse matrix X from 
a set of CS measurements Y G j£ mxn 2 acquired by Y = AX. We refer to such acquisition model as the 
uniform sampling because a unique sampling matrix A G K m /" 2Xni applies on all columns of X. 

Davies and Eldar ll27l proposed a specific rank-aware greedy algorithm, that in case of using a random 
i.i.d. Gaussian sampling matrix A is able to recover (with a high probability) an exact k -joint-sparse and 
rank r X from the noiseless MMV provided 



It can be viewed from (fT4l) that this approach is more suitable for applications with high-rank data 
matrices, where the columns of X are as independent as possible from each other. This is contrary 
to our assumption of having a redundant, low-rank data structure. This limitation comes from the fact 
that the uniform sampling scheme is not generally suitable for acquisition of correlated data matrices. 
Given highly redundant data, such sampling mechanism results in a redundant set of measurements (X is 
low-rank and so becomes Y, because rank(Y) < rank(X)). Think of the extreme example of having a 
joint-sparse rank one matrix whose columns are identical to each other. Using a uniform sampling scheme 
would result in multiple copies of the same measurements for all the columns and thus, applying any 



m > (r(ni + n 2 )) . 



(13) 




(14) 
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joint decoding scheme exploiting the intra-column correlations would be pointless as the measurements 
do not carry any new information across the columns. 

III. Simultaneous Low-rank and Joint-Sparse Matrix Approximation 

Data reconstruction from CS measurements typically reduces to solving an underdetermined and noisy 
linear system of equations such as (O where m <C n\ri2- Without any further assumptions (O would 
admit infinitely many solutions and thus, recovery of the original data matrix is not generally feasible. 
As previously discussed, however, data matrices of interest in this paper have a particular simultaneous 
low-rank and joint-sparse structure restricting data to have very few degrees of freedom, i.e. less than 
both the matrix dimension and the dimension of the underlying sparse coefficients. Our main goal in 
this part is to propose a robust recovery scheme which incorporates properly these constrains in order to 
make © invertible with as few measurements as possible. 

Recovery of a joint-sparse and rank r data matrix from a set of noisy CS measurements can be casted 
as the solution to the following minimization problem: 

argmin Card(supp(A")) (15) 
x 

subject to \\y-A(X)\\ g2 < e, 
rank(X) < r, 

where e is an upper bound on the energy of the noise, ||z||^ 2 < e. Note that finding the sparsest solution 
to a linear system of equations, in general, is an NP-hard problem [32]. Problems such as minimizing 
Card(supp(A")) or the rank minimization are similarly reducing to the cardinality minimization problem 
of finding the sparsest element in an affme space and thus in general, finding solution to a problem like 
(031 ) is computationally intractable. The original solution must be approximated by a polynomial-time 
method. 

A trivial solution could be the cascade of two classical CS recovery schemes: One identifies the joint- 
support set and the other evaluates the rows of the matrix corresponding to the recovered support using 
a low-rank approximation scheme. Unfortunately, the application of this approach is mainly limited to 
the recovery of the exact low-rank and joint-sparse matrices. In practice signals are almost never exactly 
sparse, and using the cascade strategy {e.g. recovering first the dominant A; -joint-sparse part of data, and 
then extracting a rank r matrix from it) for dense data matrices might not be always optimal in order to 
identify the most dominant component having simultaneously both structures. 
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In order to provide robustness against non-exact but approximately low-rank and joint-sparse data, we 
incorporate both data priors into a single recovery formulation. We relax (fl"5T ) by replacing the non-convex 
terms rank(X) and Card(supp(X)) by the convex terms inducing similar structures to their solutions 
i.e., the nuclear norm and the ||X||2,i mixed-norm, correspondingly. We cast the CS recovery 

problem as solving one of the following three convex problems: 

1) Having data nuclear norm as the regularization parameter, or provided with an estimation of it, we 
solve: 

argmin ||-X]|2,l (16) 
x 

subject to \\y- A{X)\\ t2 < e, 
\\ x \\* < T - 

2) Having the £2/(1 mixed-norm of data as the regularization parameter, or provided with an estimation 
if it, we solve: 

argmin ||-X"||* (17) 
x 

subject to \\y- A(X)\\ i2 < e, 
\\ x h,i < 7- 

3) Having a regularization factor A balancing between the ^2/^1 norm and the nuclear norm, we solve: 

argmin ||X|| 2 ,i + A||X||* (18) 
x 

subject to || y — A(X)\\f < e. 

We explain in Section [Vl] that all three problems above can be solved iteratively (in polynomial time) 
using the parallel proximal splitting algorithm (PPXA) li33l . Note that all three formulations (fT6l ). (fTTT ) 
and (fl~8T > contain a regularization parameter (either r, 7 or A) that should be correctly tuned. In the 
next section we show that, for properly-chosen parameters, the solutions of these convex problems can 
robustly approximate the original data provided by very few linear measurements 

IV. Theoretical Guarantees 

In this section we provide the theoretical analysis of the recovery problems (fT6l)-(fT8l. We prove that for 
certain linear mappings A, these three convex relaxations can exactly reconstruct all simultaneous low- 
rank and joint-sparse data from their corresponding CS measurements. In addition, our results guarantee 
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the stability of our approach against noisy measurements and non-exact low-rank and joint-sparse data. 
For a certain class of random linear mappings our results establish a lower bound on the number m of 
CS measurements in terms of the joint-sparsity level k and the rank r of the underlying data matrix. 

A. Sampling Operator RIP 

We define a specific restricted isometry property (RIP) that later helps us to build our main theoretical 
results. 

Definition 1. For positive integers k,r £ N+, a linear map A satisfies the restricted isometry property, 
if for all simultaneous rank r and k-joint-sparse matrices X we have the following inequalities: 

(1 - Sr, k )\\X\\ 2 F < \\A{X)\\l < (1 + S r , k )\\X\\ 2 F . (19) 

The RIP constant 5 rtk is the smallest constant for which the property above holds. 

As we can see and compared to the two former definitions of the Block-RIP |25l and the Rank-RIP 
mi, EoTl . Definition Q] provides a more restrictive property which holds for the intersection set of the 
low-rank and the joint (block) sparse matrices. Verifying whether a linear map satisfies RIP is generally 
combinatorial and thus prohibitive for large size problems. However, for linear mappings drawn at random 
from a certain class of distributions and thanks to their well-established concentration properties, the 
following theorem and corollary guarantee RIP: 

Theorem 1. Let A : M. niXri2 — > R m be a random linear map obeying the following concentration bound 
for any X E andO<t<l, 

Pr ( \\\A(X)\\l - \\Xf F \ > t\\Xf F ) < Cexp(-cm), (20) 

where C and c are fixed constants given t. Then A satisfies RIP with constant 8 r ^ with probability 
greater than 1 — Ce~ K ° m , if the number of measurements is greater than 

m> Ki^k log(ni/k) + (k + l)r + U2r^ , (21) 

where hq and n\ are fixed constants for a given 8 T} k- 

Corollary 1. For a linear map A : M. niXn ' 2 — > IR m corresponding to a matrix A G K mxnin2 whose 
elements are drawn independently at random from the Gaussian A/"(0,l/m), Bernoulli {±l/y/m} or 
subgaussian (with a proper normalization) distributions, RIP holds whenever the number of measurements 
satisfies (1211 ). 
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The proof of Theorem Q] is provided in Section IV-AI and Corollary Q] is a straightforward application 
of Theorem [TJ Indeed, for the Gaussian distribution since ||^4(X)||^ is a chi-squared random variable 
with m degrees of freedom multiplied by ||X||^/m, the following concentration bound holds: 

Pr(|||^(X)||^-||X|||| >t\\X\\ 2 F ) <2exp(-^(t 2 /2-t 3 /3))- (22) 

Similar bounds hold for the Bernoulli and subgaussian ensembles with different constants. Regarding 
Theorem [Q random linear maps drawn from these distributions indeed satisfy the deviation bound (l20l) 
and thus RIP holds for them provided (|2TT) . 

B. Main Reconstruction Error Bounds 

Our main performance bounds here are derived based on the new notion of RIP defined above. Given 
a set of linear measurements that are collected from a matrix X £ ]^™i x ™2 ( w hich is not necessarily 
low-rank or joint-sparse) through the sampling model Q and by using a sampling operator A satisfying 
RIP, the following theorems then bound the approximation error of the recovery problems (fT6l)-(fT8T): 

Theorem 2. Suppose ^6r,2fc < 1/7 and \\z\\t 2 < e. Let X r ^ be any rank r and k-joint-sparse matrix and 
To = supp(X rjfc ). 

(a) For t = \\X ||*, the solution X to (1161 ) obeys the following bound: 

\\x-x\\ F < 4 ( W x - x Johi + II* -foil. \ + ^ (23) 

V Vk V2r / 

Constants k' and are fixed for a given 8 r k- 

(b) For 7 = ||X||2 i, the solution X to (117b obeys the bound (1231) . 

(c) For A = y^r, the solution X to (118b obeys the bound (123b - 

Corollary 2. Suppose S^^k < 1/7 and the measurements are noiseless. Then, all exact (simultaneous) 
k-joint-sparse and rank r matrices X can be exactly recovered by (116b . (117b or (118b (with proper 
regularization parameters) i.e., X = X. 

The proof of Theorem [2] is provided in Section IV-BI The proof of the corollary is trivial by setting 
X T: k = X-ji = X and e = in d23l . Given noisy measurements and a data matrix X that is not exactly 
low-rank or joint-sparse, the reconstruction error (l23l is bounded by two terms. First, it is proportional 
(with a constant factor) to the power of the sampling noise. Second, it is determined by the best rank r 
and fc-joint-sparse approximation of X that minimizes the first term of (l23l . This term can be interpreted 
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as a measure of data compressibility i.e., how accurately data can be approximated by a simultaneous 
low-rank and joint-sparse matrix. 

According to Corollary [T] for sampling matrices A drawn independently at random from the Gaussian, 
Bernoulli or subgaussian distributions (and more generally, all random linear maps A satisfying d20b). 
Theorem [2] guarantees that our proposed reconstruction approach based on joint nuclear and £2/^1 
norms minimization can stably recover all simultaneous rank r and fe-joint-sparse matrices X (with 
an overwhelming probability) provided 

m > 0{k \og{n\/k) + kr + n2r) , 

which, as previously mentioned, indicates a near-optimal recovery performance regarding the degrees of 
freedom of those data matrices i.e., r(k + nz — r). In addition, this bound reveals that our proposed 
recovery approach requires significantly less number of CS measurements to reconstruct such structured 
data compared to the prior art methods (see (TTTb . ( fl3T ) and (fl4l)). Unlike the £2/^1 approach ( fTOb and 
for a large number ri2 of columns, our approach requires 0(r) measurements per column, which is 
proportional to the number of principal components of this data. In addition, compared to the nuclear 
norm minimization scheme ([121 ). we achieve a shaper lower bound on m that is proportional to the joint- 
sparsity level k of the data matrix and scales sub-linearly with the column dimensionality n\. Finally, our 
bound is of a different nature than (fT4T > : the lower the rank, less measurements are required. This reflects 
the importance of a good design for the sampling scheme A along with a recovery approach benefiting 
efficiently from data structures. 

Note that Theorem [2]precisely characterizes the proper values for the regularization parameters in (fTol - 
(|T8T >- If there is no prior knowledge available about the nuclear norm or the £2/^1 norm of the underlying 
data (and a blind tuning might be challenging), (fT8l ) brings more flexibility to the reconstruction process. 
For a given A, Theorem [2]-c guarantees that the recovery error is bounded by the minimum of (|23T ) for 
the best pair r, k satisfying both RIP (i.e. ^6r,2fc < 1/7) and A = y/k/2r. 

V. Proofs 

A. Proof of Theorem \J\ 

Let Xj- denote the k x n-i submatrix formed by selecting the rows of X indexed by a set T C 
{1, . . . , ni} with the cardinality k. Correspondingly, for a given T we define the restricted map At ■ 
ffi>fcxn 2 — > W" 1 which applies on Xq-, and is related to A through its explicit matrix formulation (O as 



November 22, 2012 



DRAFT 



15 



follows: 

A T (X T ) = A T (Xr)vec, 

where, Aj- G ^nxkn 2 j s t jj e su bmatrix containing the columns of A indexed by T ■ If we assume X is an 
exact k -joint-sparse matrix whose rows are supported over the set T and zero elsewhere, we will have 

A{X)=A T (X T ). 



In this case and according to our assumption (i.e., A satisfies (1201 )) we can write a similar concentration 
bound for a given T and Xj-\ 

\\A T (X T )\\l - \\X T \\l > t\\X T \\ 2 F ) < Cexp(-cm). (24) 



Pr 

According to the Rank-RIP property defined in ||26l , if a linear map At satisfies the bound (1241 then, 
for a given T and all rank r realizations of Xf, we have 

\\Ar(X r )\\l- \\X T \\% >S\\X T \\ 2 F ) <Cexp((k + n 2 + l)rlog(36V2/5)-cm), (25) 



Pr 

where < 5 < 1 is a fixed constant. The proof of this statement mainly uses a covering number argument 



for low-rank matrices and it can be found in E61 (see theorem 2.3). Now, by considering all possible 
combinations for the subset T i.e., ('I 1 ) < (jre) k and applying a union bound on (1251 ). we can deduce 
(fT9l ) for all k -joint-sparse and rank r matrices X with probability greater than 

1 — C exp (k + k log(ni//c) + (k + n 2 + l)r log(36\/2/5) — cmj . 

Equivalently, with probability exceeding 1 — Ce~ K " m , A satisfies RIP with constant 5 r ^ < S, whenever 

m > Ki(k\og{ni/k) + (k + n 2 + l)r^, 

where kq and ki are fixed constants for a given 5 i.e., kq = c— log(36\/2/5)/Ki, and k\ > log(36\/2/^)/c. 

B. Proof of Theorem [2] 

In this part we first provide the proof of Theorem [3J-a and later, we use similar techniques to prove 
Theorems [2]-b and [2]-c. For this purpose, we prove the following two lemmas that are the key elements 
for the proof of Theorem [2] 

An important implication of the RIP is that for linear maps A having small RIP constants, the mea- 
surement outcomes of any two A; -joint-sparse and rank r orthogonal matrices become nearly orthogonal. 
This statement is precisely characterized by the following lemma: 
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Lemma 1. \/X,X' such that, rank(X) < r, rank(X') < r', Card(supp(X)) < k, Card(supp(X')) < 
k' and (X, X') = 0, we have: 

\(A(X),A(X'))\ < 6 r+r , >k+k ,\\X\\ F \\X'\\ F (26) 

Proof: Without loss of generality assume X and X' have unit Frobenius norms. Since X ± X' is at 
most k + k! joint-sparse and rank(X ± X') < r + r', we can write 

)\\X ± X'f F < \\A(X ± X')\\l < (1 + <Ww)ll* ± 

Moreover, X and X' are orthogonal to each other which implies that ||X±X'|||i = + = 2. 

Now, by using the parallelogram identity we can deduce: 



| (A(X),A(X')) | = \\\\A(X + X')\\l - \\A(X - X')\\l 



^ r 4-r r 



fc+fc' 



which completes the proof. ■ 

Lemma 2. Let A and B be matrices of the same size and rank(yl) < r, supp(^4) = T. 3Bi,E>2 
( matrices of the same size as A and B ) such that: 

1) B = B 1 + B 2 , 

2) supp(Bi) = T, rank( J Bi) < 2r, 

3) A T B 2 = 0, A(B 2 ) T = 0, 

4) {Bx,B 2 )=0. 

Proof: Suppose Aj- and Bj- are submatrices correspondingly containing the rows of A and B indexed 
by T ■ Recht et al., (see Lemma 3.4 in fT9l ) propose a decomposition Bj- = B\ + B 2 so that 
. rank(.Bi) < 2rank(2) < 2r, 
. {A T ) T B 2 = 0, A T (B 2 ) T = 0, 
. (5i,S 2 ) = 0. 

On the other hand, lets decompose B = Bj- + B-j-<= where Bj- and B-j-o are respectively the same 
as B on the index sets T and T c (the set complement of T) and zero elsewhere. Suppose B\ T and 
B 2t are matrices of the same size as B, whose rows indexed by T are respectively populated by the 
submatrices B\ and B 2 , and the remaining elements are zero. We can write Bj- = B\ T + B 2t . Now, by 
letting B\ = B\ T and B 2 = B 2t + B-j-<= one can easily verifies that B\ and B 2 satisfy properties (l)-(4). 
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1 ) Decomposition of The Error Term: Here, by using Lemma [2] we decompose the error term H = 
X — X into the summation of rank 2r and k -joint-sparse matrices that are mutually orthogonal to each 
other. This decomposition later appears to be essential to bound the error using Lemma Q] 

Let us fix the notation so that Xj- E M n i xn 2 j s a joint-sparse matrix which is identical to X for the 
rows indexed by T, and zero elsewhere. Let 7o denote the joint support of X r f. i.e., To = supp(X r ^) 
with Card(7o) < k. Since X r ^ is additionally a rank r matrix, we can use Lemma [2] to decompose H 
into orthogonal matrices H = Hf + H such that, 

. rank(#f o ) < 2r, 

. (Hl,H)=0, 

• (Xr,k) H = X r ^H = 0. 

Further, we decompose H into the summation of disjoint /c -joint-sparse matrices. For this purpose we 
write H = Hj- + Hj-e, where T§ denotes the set complement of the support 7o- Now we decompose 
Hfc into the following terms: 

i>\ 

where T\ is the set containing the indices of the k rows of Hfc with the largest £2 norm, 72 contains 
the indices of the next k high energy rows and so on. By construction Hf t s are supported on disjoint 
sets. We simplify the notations by using Hi instead of Hf. 

Next, we decompose each Hi into the summation of rank 2r matrices: 

Hi = ^Hij. 

i>i 

We use here a similar decomposition proposed in ll26l : let Hi = UT,V T be the ordered singular value 
decomposition of Hi i.e., diag(S) is the ordered sequence of the singular values (E^j > £j + x,i+i)- F° r 
integers j > 1, define the index sets Jj := {2r(j — 1) + 1, 2rj}. Then, the decomposition terms will 
be 

: r. / v,n; / )'. 

i.e. Hi i corresponds to the first 2r largest singular values of Hi, Hi ^ corresponds to the next 2r largest 
singular values of Hi and so on. 

Those two steps suggest a decomposition H = Yli>o j>x^-i,h wnose terms H%j are k -joint-sparse 
and rank 2r matrices mutually orthogonal to each other i.e., V(i,j) 7^ (i',f) 

(Hi j, H V j>) = 0. 
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Additionally, Hj- is orthogonal to all Hij as a result of either having disjoint supports (i.e., for i > 1), 
or by the specific constructions of Hj- (see Lemma |2] in this paper, and Lemma 3.4 in |[T9l ) and Non- 
standard computations show that the following inequality holds for all i: 



v 2r 



Also, we have 

l-H^Iki 

i>2 

Inequalities (|27T ) and (|28T ) are respectively the consequences of 

\\Hij\\ F < (2r)- 1 / 2 ||ff. j _ 1 ||, 

and for i > 1 

||^i||F<fc~ 1/2 ||^i-l||2,l- 

As a result, the following inequalities hold: 

S- ^ II^j'IIf 
i>o,i>i 

(i,i)^(l,l),(0,l) 



< ^ • (28) 



< 5^||ffi||F+ J] II^jHf (29) 

i>2 i>0,j>2 



< ^H^Iki + ^Ell^H* (30) 



i>0 



^7= k llHvh - 1 + Wr m - <31> 



2) Proof of the Main Error Bounds: We start with the proof of TheoremEJ-a. Since A(X) is corrupted 
by a noise whose £2 norm is assumed to be bounded by e, and X is a feasible point for (fT6l ) we can 
write, 

P(tf)||, 2 = P(*-*)k 

< |||/ - A(X)\\ i2 + \\y - A(X)\U 2 < 2e. (32) 
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By assumption \\X\\* = r, and since X is a feasible point of (fl6l ) satisfying \\X\\* < r we have 

^ 11^ + -^11* 

> ||^r,fc + ~ ||-^r,fc||* ~~ l|-^77>ll* 

= ||-^r,fc||* + 11-^11* ~~ ll^r,fc||* ~~ 11-^77,11*1 (33) 

which yields the following inequality: 

\\H ||* < \\Hj- o II* + 2||X r) jt||*. (34) 



Note that equality (1331) holds because both row and column spaces of X r ^ and H are orthogonal to each 
other. On the other hand, as a result of the minimization (fl6l ) we have 

ll^lb.i > \\X + -ff||2,i 

= \\ X T + H T ( ,h,l + \\X T J + #7? II 2,1 

> \\X% Ib.i - ll-ffrolb.i - ||^r c ll2,i + ll-f^Hki- 

Consequently, we can deduce: 

11^73=112,1 < H^Tolb,! + 2||X To e|| 2il . (35) 
Note that Hjg = Hfc. Now, dividing both sides of (|35T ) by \f~k and using (|5]) yields: 

^=P?73>||2,1 < ll#T ||F + 2e 

^ 11-^77, Hf + ||-Ho,i ||f + II-^ojIIf + 2e 

i>2 

< ||Ff o || F + ||F ,i||ir + ^=||F||* + 2e (36) 

V2r 



< 2||^J F + ||i?o,i||F + 2e' + 2e, (37) 

where e = ^^"i 2,1 and e' = ^ X j£p* ■ Inequality (l36l ) is the consequence of (UTT l. and for deriving (I37T ) 
we use (O and ©. 

Using (|37T ) and d34l , we can continue to upper bound the summation of the Frobenius norms in ( f3TT ) 
as follows: 

S < 3\\H^ o \\ F + \\H 0tl \\ F + Ae' + 2e. 

<3V2\\Hl + H 0>1 \\ F + e", (38) 

November 22, 2012 DRAFT 



20 



where e" = 4e' + 2e. Note that for the last inequality we use 

11-^78 ll-F + II-^0,i||f < V2\\Ht + ^0,l||F) 
which holds because Hj- and #o,i are orthogonal to each other. 



Meanwhile, by such specific error decomposition we have 

rank(#£ o +#o,i+#i,i)<6r, 

supp(#^ +# ,i + #1,1) < 2k ' 
and thus according to Definition Q] we can write 



;i - Ser,2k)\\H^ o + # ,i + H ltl \\ 2 F < \\A{H° To + # ,i + #1,1) ||| 



On the other hand, 



\\A(Hl + #0,1 + #1,1) 



(A(Hl + #0,1 + H 1A ),A(H - #)) 



where, 



# 



E 



Hi j. 



i>0,j>l 
(tj)^(l,l),(0,l) 



Using the Cauchy-Schwarz inequality and (1321 yields: 

(^(^+#0,1 + ^1,1),^)) 

<|Kijf o + # 0l i + #i,i)lk P(#)lb 



< 2e y/l + 56r,2fc H-H75 + #0,1 + #1,1 ||f- 



(39) 



(40) 



Moreover, because Hj- + #0.1 is at most a rank 4r, fc -joint-sparse matrix orthogonal to the summation 
terms in #, and so is #11 (a rank 2r and /c -joint-sparse matrix), we can use Lemma Q] to write 

*>o,i>i 

(i,i)^(l,l),(0,l) 



(A(H^ + H 0A + H hl ),A(H)) <[S 

6r,2fc||#T + -^0,1 1| F + ^4r,2fc||#l,l||F ) X 



< y/2 56r,2fc||#7J, + #0,1 + #l,l||F x S 

< 6^6r,2fc||#73 + #0,1 + #l,l||f 1 + V^e" #6r,2fc||#7^ + #0,1 + #1,i||f- 

(41) 
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The second inequality is the outcome of the orthogonality of Hj- + Hq : i and H\ t i, and replacing the 
sum of the Frobenius norms by its upper bound (I38T ) gives the last inequality. Inequalities (l40b and ((4T|> 
are providing an upper bound on \\A(Hj- + H i + Hi ,i)||f : 

M(flr + #0,1 + #l,l)||! 2 < 6<y 6 r,2fe||flro +^o,i + #i,i||| 

+ (\/2 e" 5 6 r,2fe + 2e V 1 + 5 4r,2fe) ||#f + #o,i + #i,i||f- 
Now by using the lower bound derived in d39l ), we can deduce the following inequality: 

||^+Fo,i+Fi, 1 || F <ae" + /?e, (42) 

where, 



1 — 7<56r,2fc 

and 



„ _ 2 \A + ^6r,2fc 
1 - 7(56r,2fc 



Finally, we upper bound the whole error term H as follows: 

\\H\\f < 11-^77, + #o,i + #i,i||f + 5 

< (3\/2 + i)||fif o +^ 0jl+ ff ljl || Ji , + e / ' (43) 



Inequality d43l uses the bound derived in d38l ). Replacing \\Hj- + i?o,i + #i,i||f by its upper bound 
in (1421 results in the last line where the constants k' = 4 + 4(3\/2 + l)a and k\ = (3\/2 + are fixed 
for a given RIP constant 5e r ,2k < 1/7. This completes the proof of Theorem [2]-a. 



Proof of Theorem |2l-b: we follow the same steps as for Theorem |2]-a. Here, inequalities (l33l and (l34l 
are the outcome of the nuclear norm minimization in (fTTT ) i.e., || X||* > || + and (I35T ) holds since 
7 = ||^||2,i > H-X" + #||2,i- Thus, one can easily verifies that all bounds (|27T)- (|441 are valid and the 
solution of (fTTT ) obeys the error bound in (|23l . 

Proof of Theorem 121-c: it is also similar to Theorem |2]-a and almost all the previous bounds hold. We 
just need to develop a new bound on -7? || #77*11 2 ,1 + "75=l|#l|* inequality (I3TT) . From minimization 
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<f]~8T > we have 

ll^lb.i + ^11-^11* 

> \\X + H\\ 2 ,i + \\\X + H\\* 

> \\ x To h,i - \\Hr a h,i - ll^-Tylki + II#t c I|2,i 

+ A + — ||-Xr,fc||* ~~ 11-^73 II*) • 

Thus, we can write the following bound: 

11^112,1 + A||#||* < ||fl"roll2 > i + 2||^7?lki + A||i4j* + 2A||X r , fe ||*. 
Dividing both sides by Vk yields: 

-Lll^lb,. + j-m. < ll^rJk + Ayf 1KB, + 2, 4- 2Ayf i 

< \2 + \^pj ||iJ^|| P + ||H , „, 1 || P + 2e + 2(l + Ay^Je'. 
Note that the last inequality uses the bound 

\\H % \\f < 2\\H^\\ F + \\H 0!l \\ F + 2e', 

which has been previously developed for obtaining d37l ). Now, by letting A = we achieve the 

following upper bound: 

-L||ff^|| 2il + -^ll^ll* < 3||^||f + ||#o,l \\f + 4e' + 2e, 



which gives the same bound as in (1381) for sum of the Frobenius norms. All other steps and inequalities 
that are derived for the proof of Theorem 12-a are valid as well for Theorem 12-c. Therefore, solving (fT8l) 



with A = J£ leads to the same error bound as in d23l . 

VI. The Parallel Proximal Algorithm 

Problems (fToT)-([T8T) are nonlinearly constrained non-differentiable convex problems. There are numerous 
methods for solving problems of this kind. We discuss here our approach based on the Parallel Proximal 
Algorithm (PPXA) proposed in [33 ]. A remarkable advantage of using this method is the ability of having 
a parallel implementation. In a nutshell, PPXA is an iterative method for minimizing an arbitrary finite 
sum of lower semi-continuous (l.s.c.) convex functions, and each iteration consists of computing the 
proximity operators of all functions (which can be done in parallel), averaging their results and updating 
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Algorithm 1: The Parallel Proximal Algorithm to solve (fT6l)-(IT8l) 



Input: y, A, the regularization parameters (e and r, 7 or A), the convergence parameter j3 > 0. 
Initializations: 

n = o,x = r 1)0 = r 2i0 = r 3 , e R niXn2 

repeat 

for (i = 1 : 3) do 



-Pj,n — P rox 3l3f i (^i,n) 

end 

= (-Pl,n + -P2,n + -P3,n)/3 

for (t = 1 : 3) do 
end 

until convergence; 



the solution until the convergence. The proximity operator of a l.s.c. function f(x) : R n — > R is defined 

as proxy : M n -)■ R n : 

axgrnin /(x) + - x\\j 2 , (45) 

which may have an analytical expression or can be solved iteratively. 

In this part we rewrite each of the problems (fT6l)-(fT7T) as a minimization of the sum of three l.s.c. 
convex functions 

argmin / 1 (X) + / 2 (X)+/ 3 (X). (46) 

X£l"l x "2 

The template of the PPXA algorithm that solves this minimization is demonstrated in Algorithm Q] In the 
following we define those functions /j and we derive their corresponding proximity operators. Note that 
because /j is defined on R niXri2 , the definition of the proximity operator in (1431 ) is naturally extended 
for matrices by replacing the £2 norm with the Frobenius norm. Finally, f3 > (in Algorithm [U is a 
parameter controling the speed of convergence. 

A. Constrained £2/^1 Norm Minimization 

We can rewrite the minimization problem (fT6l ) in the form (l46l ) where 

h{X) = \\X\\ 2 ,i , f 2 (X) = i Bt2 (X), and f 3 (X) = i B ,(X). 
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The convex sets Bt> 2 , £>* C l" lX " 2 correspond to the matrices satisfying the constraints \\y — A(X)\\ i < e 
and \\X\\* < r, respectively. The indicator function ic of a convex set C is denned as follows: 



if X e C 



prox afl (X)) = n ' y 2 |L J+ X h . (47) 



L +oo otherwise 
Standard calculations show that Vi G {1, . . . , rti} 

l-Xi,. [U P - a) 

which is the soft thresholding operator applied on the rows of X to shrink their £2 norms. By definition, 
the proximal operator of an indicator function ic(X) is the the orthogonal projection of X onto the 
corresponding convex set C. Thus, prox a j- 2 (X) and prox a ^ (X) are respectively the orthogonal projection 
of X onto the sets Be 2 and £>* . For a general implicit operator A, the projector onto Be 2 can be computed 
using an iterative forward-backward scheme as proposed in ||34l . However, once A corresponds to a tight 
frame (i.e. Vy € R m A(A*(y)) = vy for a constant v) then according to the semi-orthogonal linear 
transform property of proximity operators ll33l . the orthogonal projection onto Bi 2 has the following 
explicit formulation: 

prox af2 (X) = X + v- 1 A*(T)(l-e\\r\\l 1 ) + (48) 

where, r = y — A(X). 

Finally, let X = U'EV T be the singular value decomposition of X with S = diag(<7i, 02 ■ ■ ■), then the 
projection of X onto B* corresponds to the projection of its singular values onto the following positive 
simplex (£1 ball): 

Bit = {0-1,0-2, ...,£t + :^(7 t <r}. 



More precisely, we have 



prox af (X) = £/diag(o-i,...,cr r )y T , 



where \a\, a-i, ■ ■ ■] = Vb^ ([f"i> 02, • • •]) an d Vb^ denotes the orthogonal projection of the singular 
values onto the set Be 1 ■ This projection can be computed within few low-complex iterations using the £\ 
projection Matlab implementation of the SPGL1 software package ll35l . 

B. Constrained Nuclear Norm Minimization 
We rewrite the minimization ( fTTl ) as d46l) where 

h(X) = \\X\U , MX) = i Be2 (X), and f 3 (X) = %b^{X). 
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The function fi and its proximity operator are the same as in the previous part. The convex set B^/^ 
corresponds to the matrices satisfying 1 1 1 1 2,1 < 7, and prox a f (X) is the orthogonal projection of X 
onto B i3 /g$ 

The proximity operator of the nuclear norm is the soft thresholding shrinkage operator applying on 
the singular values of X: 

pwx afi (X) = C/diag(a 1) a 2 , . . .)V T (49) 

where, Vi <7j = (cr* — a),. 

C. Constrained, Nuclear norm plus £2/^1 Norm Minimization 

Here, we rephrase problem ([T8T ) as (l46l ) where the functions in the summation are 

= ||X|| 2>1 , f 2 {X) = i B jX), and f 3 (X) = \\\X\\.. 

According to our previous descriptions, prox a ^ is the orthogonal projection onto the set Be 2 (X), prox a ^ 
is the singular value soft thresholding as in (l49l and prox a j- i is equivalent to the soft thresholding of the 
£2 norms of the rows of X as in (|47T ). 

Note that, for all three above-mentioned implementations using the "economized" singular value 
decomposition is very useful as the data matrix is assumed to be low-rank and thus it avoids a huge 
computation at each iteration (particularly for a high-dimensional data)j^| Since the rank r of the solution 
at each iteration is unknown, at the first iteration of Algorithm [T] r is initially set to some value and later, 
incremented by doubling its value until achieving the zero singular value in IVI-A1 or the singular value 
soft thresholding parameter a in IVI-BI and IVI-CI For the next iterations, r is initialized by the value of 
the last iteration and so on. This idea has been used in the SVT software package |[36l for the singular 
values soft thresholding and our experiments indicates that in the early iterations of Algorithm Q] and 
within at most 2 or 3 sub-iterations, r is set to its correct value, and remains fixed for the forthcoming 
iterations. 

2 The Matlab code to perform such projection can be found in the SPGL1 package and is a natural extension of the vector 
projection onto the l\ ball. 

3 The "economized" singular value decomposition of a rank r matrix, here, refers to computing only the r largest (i.e. nonzero) 
singular values. 
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VII. Numerical Experiments 

In this section we conduct a series of experiments in order to demonstrate the average behavior of 
the proposed simultaneous low -rank and joint-sparse recovery approach for different problem setups, i.e. 
different matrix sizes n\ x 77.2, joint-sparsity levels k, ranks r and number of CS measurements m. Among 
the three formulations suggested in this paper, we choose the nuclear norm plus the £2/^1 mixed-norm 
minimization scheme in (fT8l) (referred to as Nuclear-^/^i) as it requires the minimum prior knowledge 
about the underlying data, and the regularization parameter is set to A = \ as suggested by Theorem 
12-c. A rank r and /c-joint sparse matrix X S M niX ™ 2 is generated by multiplying two independent i.i.d. 
Gaussian matrices X\ G M fcxr and X2 € R rxn2 , and populating uniformly at random k rows of X 
by the rows of XiX 2 . The figures that are used for performance evaluation demonstrate the empirical 
reconstruction error phase transition: for a certain problem setup (m, n 2 , k, r, m), the normalized recon- 
struction error \\X — X\\f/\\X\\f has been averaged over twenty independent random realizations of X 
and A. Figures are plotted in gray scale and white pixels correspond to high reconstruction errors while 
black pixels are indicating the recovery regions. 

Figure 1 1 (a) demonstrates the performance of our proposed scheme at recovering 40 x 40 random 
matrices of rank r = 2, and for different joint-sparsity ratios (^-) and subsampling factors (^^)- A 
dense i.i.d. random Gaussian sampling scheme A has been applied for compressed sampling of data. We 
can observe that for recovering data matrices with high joint-sparsity ratios more CS measurements are 
required. Meanwhile, it turns out that Nuclear-^ 2 /£i is able to recover even dense (non-sparse) data 
matrices {i.e., k = n{) subsampled by retaining less than 40% of their original sizes. This should not 
come as a surprise : despite loosing sparsity, Nuclear-^/^i still takes advantage of data correlations 
due to the low-rank structure, which ultimately results in a successful reconstruction for an oversampling 
factor nearly four times the degrees of freedom of the original data. 

Figure |l(b)| illustrates the error phase transition behavior of the £2 / 1\ minimization ( fTOb for the same 
setup. As we expect, for such highly structured data, our proposed scheme significantly outperforms the 
£2/^1 approach by having a much wider recovery region. 

Figures [T (c)| and [T(d)1 repeat similar experiments but for rank r = 4 matrices. The recovery region of the 
joint Nuclear-^Ar norm minimization is narrowed compared to the Figure [1(a)] as more measurements 



are required to compensate higher degrees of freedom of data in this scenario. However, this method still 
keeps outperforming the £2/^1 recovery approach. 
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Subsampling ratio (m /n^] Subsampling ratio (m fnj\^ 

(c) Nuclear-^2/^1, t = 4 (d) £2/^1 minimization, r = 4 

Fig. 1. Reconstruction error phase transitions for 40 x 40 matrices of rank r acquired by a dense i.i.d. Gaussian sampling 
matrix: Sparsity ratio (fc/ni) vs. Subsampling ratio ( „ 7 ^ 2 )■ 

Reconstruction Using Distributed Sampling Schemes 

In another experimental setup, we demonstrate the influence of different sampling schemes on our 
recovery approach. Besides dense sampling, corresponding to the use of dense i.i.d. random Gaussian 
matrices A in (O, we use the two following schemes particularly suited for distributed sensing applica- 
tions: 

• Distributed Uniform Sampling: In this case all columns of the data matrix are sampled by a unique 
i.i.d. Gaussian random matrix A G M mxni and thus the sampling operator A corresponds to a block 
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Fig. 2. Reconstruction error phase transitions for 30 x 112 matrices having r — 3 and k — 10: Number of columns (712) vs. 
Subsampling ratio (;^-). 



diagonal matrix given below: 



.4 



A 
A 



(50) 



... A_ 

where rh = m/n 2 denotes the number of measurements collected per column. As previously 
mentioned in Section 111-51 this sensing model is analogous to the multiple measurement vectors 
(MMV) model. For multichannel signal applications such as sensor networks where each column of 
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the data matrix represents the observation of a sensor node, this scheme enables all nodes to use a 
unified mechanism for sampling their own local observations in a distributed manner (i.e. sampling 
without intra-channel collaboration). 

Distributed Independent Sampling: In this case, an independent random Gaussian matrix Aj £ fl£ mxri i 
is used to collect fh samples from each column j of the data matrix, and thus the sampling operator 
corresponds to a block diagonal matrix with distinct blocks: 



A 



Ax 
A 2 










.4 



"2 



(51) 



where fh = m/n2- For multichannel sensing applications with strong intra channel correlations, this 
sampling strategy gives the freedom of designing the blocks so that the measurements of different 
channels carry diverse and non-redundant information to the decoder. 



Figures |2(a)P(d)| demonstrate the reconstruction error phase transitions for various numbers of columns 
n 2 and CS measurements m. Data matrices of size 30 x n 2 with a fixed rank r = 3 and joint-sparsity 
level k = 10 are drawn at random and subsampled by dense, distributed independent or distributed 
uniform random sampling matrices. These graphs suggest a useful tradeoff for practical multichannel 
sensing setups, e.g. in sensor network applications, in order to design low cost/complex sensors but 
distribute many of them (high number of channels n 2 ). As we can see in Figure [2(a)] the recovery region 
of Nuclear-^/i'i using dense sampling matrices follows the behavior predicted by the theoretical 
analysis (see Corollaries Q] and [2]) i.e., 



m 



>0 



ni n 2 



k \og(n\/k) + kr r 
ni n 2 ni 



that is a constant plus a decaying term. A very similar recovery performance can be observed in Figure 
|2(b)| using distributed independent sampling. In both cases by increasing n 2 , Nuclear-^/^i requires 
less CS measurements per column (up to a minimal measurement limit). 

Figure |2(c)| demonstrates the performance of the t 2 / 1\ recovery scheme using dense random sampling 
matrices. Compared to Nuclear-^i, the recovery region shrinks and more remarkably for high values 
of n 2 , the minimal measurement limit becomes larger i.e. more measurements per column is required. 
As a result, for multichannel sensing applications Nuclear-^/^i achieves better tradeoff between the 
number of sensor nodes and their complexity compared to t 2 jt\ recovery approach. 

Finally, we repeat the same experiment for the Nuclear-^i approach and using distributed uniform 



random sampling matrices. We can observe in Figure 2(d) that by adding extra channels recovery does 
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not improve much, which is consistent with our early explanations: The uniform sampling is not a proper 
strategy for CS acquisition of low-rank data matrices because the resulting measurements become highly 
redundant and they do not carry diverse information across channels. 

VIII. Conclusion 

In this paper we showed that combining the two key priors in the fields of compressive sampling 
and matrix completion, namely joint-sparsity and low-rank structures, can tremendously enhance CS 
dimensionality reduction performance. We proposed a joint Nuclear-^/^i convex minimization ap- 
proach that is able to efficiently exploit such restrictive structure. Our theoretical analysis indicates that 
the Nuclear-^/^i scheme is robust against measurement noise and non-exact low-rank and joint- 
sparse data matrices. It can achieve a tight lower bound on the number of CS measurements improving 
significantly the state-of-the-art. We provided algorithms to solve this type of minimization problem 
enabling us to validate these improvements by several numerical experiments. 

Extension of this work includes theoretical analysis of the Nuclear-^/^i approach for structured 
random matrices (e.g., distributed independent random sampling matrices), as well as considering alterna- 
tive structures such as simultaneous low-rank, sparse (rather than joint-sparse) and positive semidefinite 
matrices which may find applications in high-dimensional sparse covariance matrix estimation. We are 
also interested in further applications of this approach in multichannel signal compressive sampling. 
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